Add Short and BigDecimal type mappings to SqlTypeUtils#763
Open
gknz wants to merge 3 commits into
Open
Conversation
Short and java.math.BigDecimal had no entry in the JDBC type map, so columns of those Java types were created as VARCHAR(255) by the JDBC table sink in overwrite mode. Map Short (and short) to SMALLINT and BigDecimal to NUMERIC; the PostgreSQL dialect map inherits both from the default map. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
SqlTypeUtils now maps Short->SMALLINT and BigDecimal->NUMERIC for the generated DDL, but the sinks still bound those values via the String fallback. Complete the write path: - JavaTableSink: bind via setShort / setBigDecimal - SparkTableSink: map Short->ShortType and BigDecimal->DecimalType(38,18) so the Spark DataFrame writes SMALLINT / NUMERIC columns instead of text
Contributor
|
Hi thanks for the contribution, if possible it would also be nice to have a unit test for this |
Contributor
|
@mspruc what unit test do you have in mind? This PR seems to simply add two more data types |
Contributor
|
@zkaoudi well something that tests the sinks ideally, e.g. for https://github.com/gknz/wayang/blob/efb35979111a6bb84192acd2ade8424bcd1ee463/wayang-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/operators/SparkTableSink.java#L156 we have two magic numbers, I'd like to make sure this conversion is stable. |
Add unit and end-to-end tests pinning the Short/BigDecimal handling across the table sinks, and fix a decimal-truncation issue the tests surfaced. - SqlTypeUtils: map BigDecimal to NUMERIC(38,18) instead of a bare NUMERIC. A bare NUMERIC defaults to scale 0 on most engines (H2, MySQL, SQL Server, Derby, ...) and silently truncates the fractional part; an explicit scale fixes this on every database and matches Spark's DecimalType default (38, 18). - SqlTypeUtilsTest: cover the full default type map (including Short and BigDecimal) and confirm the PostgreSQL dialect inherits them. - SparkTableSink: make getSparkDataType package-private so it can be unit-tested. - SparkTableSinkTest: assert the Java-class to Spark DataType mapping (including BigDecimal to DecimalType(38,18) and the String fallback), plus an end-to-end write of all supported types to H2 verifying values and SMALLINT/NUMERIC(38,18). - JavaTableSinkTest: end-to-end write of all supported types to H2 verifying values and column types.
Author
|
@mspruc hey! I've added unit tests for sqlTypeUtils, SparkTableSink, and JavaTableSink. Please feel free to read the updated descriptions for all the changes/additions I went with and why. If you have any comments, please feel free to let me know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds proper SQL type support for
Shortandjava.math.BigDecimalacross thetable-sink write path. Previously neither had a JDBC type mapping, so the sink
created such columns as
VARCHAR(255)(numbers stored as text, with no numericordering/aggregation in the DB). This PR maps them to real numeric SQL types and
binds their values correctly on both the Java and Spark execution paths.
Changes
1. Type mapping (
SqlTypeUtils)Short/short→SMALLINTBigDecimal→NUMERIC(38,18)2. Value binding (the DDL type alone is not enough)
Mapping the column type without binding the value would bind a string into a
numeric column. Each in-JVM sink is fixed:
JavaTableSink: bind viaPreparedStatement.setShort(...)/setBigDecimal(...)instead of the
setStringfallback.SparkTableSink: mapShort → DataTypes.ShortTypeandBigDecimal → DataTypes.createDecimalType(38, 18), so the DataFrame schema(and the JDBC write derived from it) produces
SMALLINT/NUMERIC(38,18).Why
NUMERIC(38,18)rather than a bareNUMERICA bare
NUMERICis not portable for decimals. Per the SQL standard, omitttingthe scale means scale 0, so most engines create an integer column and silently
truncate the fractional part: H2, MySQL (
DECIMAL(10,0)), SQL Server(
DECIMAL(18,0)), Derby, DB2 all do this (e.g.12.345 → 12). PostgreSQL is thelenient exception (bare
NUMERICis unbounded), which can hide the bug. Anexplicit scale avoids the truncation on every database.
Why specifically
38and18These are not arbitrary:
(38, 18)is Spark'sDecimalType.SYSTEM_DEFAULT—the precision/scale Spark assigns to a decimal when none is otherwise known
(
38is also the maximum precision Spark'sDecimalsupports, and a value widelysupported by Oracle/SQL Server). We deliberately reuse the same values on the
Java/DDL path (
SqlTypeUtils → NUMERIC(38,18)):38= max total significant digits (generous headroom),18= digits after the decimal point (ample for monetary/rate data),types for a
BigDecimal. In Wayang the optimizer decides which platform runsthe sink, so the output schema must not depend on that choice — matching Spark's
default keeps both paths consistent.
Trade-off: a fixed scale pads short values with trailing zeros
(
12.345 → 12.345000000000000000). This is numerically identical and is exactlywhat the Spark path already produced;=, in return decimals aree portable and lossless
across all databases.
Tests
SqlTypeUtilsTest: the full default type map (incl.Short → SMALLINT,BigDecimal → NUMERIC(38,18)) and PostgreSQL inheritance.SparkTableSinkTest:getSparkDataType(...)— the whole Java-class →Spark
DataTypetable, theStringfallback, and explicitlyBigDecimal → DecimalType(precision=38, scale=18);round-tripped values and the created column types (
SMALLINT,NUMERIC(38,18)).JavaTableSinkTest: the analogous end-to-end test for the Java sink.All tests run against in-memory H2 — no external database required.
Notes
PostgresTableSinkOperator, etc.) are unaffected :they compose
CREATE TABLE ... AS SELECTand let the DB infer types, so theynever bind a Java value or use
SqlTypeUtils.getSparkDataTypewas widened fromprivateto package-private only to makethe mapping unit-testable.